These slides are inspired by:
"Creating effective figures and tables" by Karl Broman. Code available on GitHub repository. Introduction to Data Visualization course by Peter Aldhous
It is a chapter in a data visualization chapter in this book
17-3-7
These slides are inspired by:
"Creating effective figures and tables" by Karl Broman. Code available on GitHub repository. Introduction to Data Visualization course by Peter Aldhous
It is a chapter in a data visualization chapter in this book
To illustrate how some of these strategies compare let's suppose we want to report the results from two hypothetical polls asking regarding browser preference taken in 2000 and then 2015.
Here, for each year, we are simply comparing four quantities, four percentages.
A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:
Here we are representing quantities with both areas and angles since both the angle and area of each pie slice is proportional to the quantity it represents.
This turns out to be a sub optimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when only are is available.
Pie chart of browser usage.
To see how hard it is to quantify angles and are note that the rankings and all the percentages in the plots above changed fro 2000 to 2015.
Can you determine the actual percentages and rank the browsers' popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot.
| Browser | 2000 | 2015 |
|---|---|---|
| Opera | 3 | 2 |
| Safari | 21 | 22 |
| Firefox | 23 | 21 |
| Chrome | 26 | 29 |
| IE | 28 | 27 |
The preferred way to plot quantities is to use length and position since humans are much better at judging linear measure.
The barplot uses bars use this approach by using bars of length proportional to the quantities of interest.
By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the quantifying through the position of the top of the bars.
Position and length are the preferred ways to display quantities over angles which are preferred to area.
Brightness and color are even harder to quantifying that angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being displayed.
When using length (e.g. barplots) it is misleading not to start the bars at 0.
This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed.
By avoiding 0, relatively small difference can be made to look much bigger than they actually are.
This approach is often used by politicians or media organizations trying to exaggerate a difference.
When using position rather than length, it is not necessary to include 0.
This is particularly the case when we want to compare differences between groups relative the variability seen within the groups.
(Source: President Barack Obama’s 2011 State of the Union Address via [Peter Aldhous] (http://paldhous.github.io/ucb/2016/dataviz/index.html))
To motivate this principle let's asssume an extraterrestrial is interested is the difference in heights between males and females.
A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, shows the average and standard errors (standard errors are defined in a later chapter, but don't confuse them with the standard deviation of the data).
What do we learn? On average, males are taller than females.
We also note dark horizontal demonstrating that many report values are rounded to the nearest integer.
Since there are so many points it is more effective to show distributions, rather than show individual points.
This plot makes it much easier to notice that men are, on average, taller.
If instead of histograms we want the more compact summary provided by boxplot, then we align the horizontally, since, by default, boxplots move up and down with changes in height.
grid.arrange(p1, p2, p3, ncol = 3)
Visual cues to be compared should be adjacent
When comparing income data between 1970 and 2010 across region we made a figure similar to the one below.
A difference is that here we look at continents instead of regions, but this is not relevant to the point we are making.
Exceptions are
Slope charts
Bland-Altman plots
shape argument.When choosing colors to quantify a numeric variable we chose between two options sequential and diverging.
Here are some examples offered by the package RColorBrewer
library(RColorBrewer) display.brewer.all(type="div")
(Source: Karl Broman)

| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.8826320 | 18.3397861 | 18.3397861 |
| California | 1950 | 13.9124205 | 4.7467350 | 4.7467350 |
| California | 1960 | 14.1386471 | 0.0000000 | 0.0000000 |
| California | 1970 | 0.9767889 | 0.0000000 | 0.0000000 |
| California | 1980 | 0.3743467 | 0.0515466 | 0.0515466 |
## `mutate_each()` is deprecated. ## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead. ## To map `funs` over a selection of variables, use `mutate_at()`
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.9 | 18.3 | 18.3 |
| California | 1950 | 13.9 | 4.7 | 4.7 |
| California | 1960 | 14.1 | 0.0 | 0.0 |
| California | 1970 | 1.0 | 0.0 | 0.0 |
| California | 1980 | 0.4 | 0.1 | 0.1 |
## `mutate_each()` is deprecated. ## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead. ## To map `funs` over a selection of variables, use `mutate_at()`
| state | year | Measles | Pertussis | Polio |
|---|---|---|---|---|
| California | 1940 | 37.9 | 18.3 | 18.3 |
| California | 1950 | 13.9 | 4.7 | 4.7 |
| California | 1960 | 14.1 | 0.0 | 0.0 |
| California | 1970 | 1.0 | 0.0 | 0.0 |
| California | 1980 | 0.4 | 0.1 | 0.1 |
## `mutate_each()` is deprecated. ## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead. ## To map `funs` over a selection of variables, use `mutate_at()`
| state | disease | 1940 | 1950 | 1960 | 1970 | 1980 |
|---|---|---|---|---|---|---|
| California | Measles | 37.9 | 13.9 | 14.1 | 1 | 0.4 |
| California | Pertussis | 18.3 | 4.7 | 0.0 | 0 | 0.1 |
| California | Polio | 18.3 | 4.7 | 0.0 | 0 | 0.1 |